By Ethan Xu, Emily Heiss, and Brandon Quan
The purpose of this page is to guide you through the Data Science process and answer some essential questions about the COVID-19 pandemic.
Around the world, there is a vast amount of "vaccine hesitancy". Vaccine hesitancy is defined by the CDC as individuals who state they are "unsure", "probably won't", or "definitely won't" receive the vaccine. You can look at the CDC's visualization and article to understand the extent of vaccine hesitancy in the United States, but many of you will be familiar with the status of the vaccine debate in the United States. What about the rest of world? Who's right in this debate? How do the vaccines affect the rate of cases and severity of symptoms? Perhaps the numbers coming from the producers of the vaccines aren't enough to convince you one way or other. We believe we can help you take a more informed position through the use of data science.
Table of Contents
Before we get into the data science, let's go over some essential technology we'll be working with. This whole project was created in jupyter notebook. Jupyter notebook supports many different languages, but we're going to use the one you've probably heard the most about Python3.
Python has several libraries that help us immensely. A list has been provided below, but don't feel like you have to read through every link. Understanding every piece of tech isn't essential.
We use more than just the four libraries above, but they are the most essential.
Now onto the science! The first step of the Data Science process is data collection. We're going need some data to analyze. At this point, it's important to consider what question are we asking? We'll start off with something simple. How do the most common COVID-19 vaccines affect the number and severity of cases?
In order to answer this question, we're going to grab some data from the WHO (World Health Organization). Click here to see a nice table the WHO has provided on Covid around the world. We're going to download some of the csv's the WHO used to create this data.
Please note: You can download the csv's yourself and follow along in your own notebook, or you can follow this link to a our google drive and run this notebook yourself! Everything's already there, the csv's are downloaded and all the code is written. We heavily recommend this option.
Let's begin by importing the necessary libraries.
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sys
from sklearn.svm import SVC
from scipy import stats
from scipy.stats import norm
from sklearn import linear_model
from folium.plugins import TimestampedGeoJson
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
Next we're going to get the data out of the csv's and into a Pandas dataframe.
# url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
# c=pd.read_csv(url)
cases_deaths_df = pd.read_csv('https://raw.githubusercontent.com/EthanXuXu/ethanxuxu.github.io/main/WHO-COVID-19-global-table-data.csv', delimiter=",")
cases_deaths_df.head()
| Name | WHO Region | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | Deaths - newly reported in last 24 hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Global | NaN | 199466211 | 2559.050333 | 4158340 | 53.349393 | 548167 | 4244541 | 54.455309 | 64036 | 0.821549 | 8430 |
| 1 | United States of America | Americas | 35010407 | 10577.080000 | 618994 | 187.010000 | 78722 | 609022 | 183.990000 | 2759 | 0.830000 | 387 |
| 2 | India | South-East Asia | 31769132 | 2302.100000 | 284527 | 20.620000 | 42625 | 425757 | 30.850000 | 3735 | 0.270000 | 562 |
| 3 | Brazil | Americas | 19953501 | 9387.260000 | 245839 | 115.660000 | 15143 | 557223 | 262.150000 | 6721 | 3.160000 | 389 |
| 4 | Russian Federation | Europe | 6356784 | 4355.920000 | 161552 | 110.700000 | 22589 | 161715 | 110.810000 | 5537 | 3.790000 | 790 |
vax_df = pd.read_csv('https://covid19.who.int/who-data/vaccination-data.csv', delimiter=",")
vax_df.head()
| COUNTRY | ISO3 | WHO_REGION | DATA_SOURCE | DATE_UPDATED | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Falkland Islands (Malvinas) | FLK | AMRO | OWID | 2021-04-14 | 4407 | 2632.0 | 126.529 | 75.567 | 1775.0 | 50.962 | AstraZeneca - AZD1222 | NaN | 1.0 |
| 1 | Saint Helena | SHN | AFRO | OWID | 2021-05-05 | 7892 | 4361.0 | 129.995 | 71.833 | 3531.0 | 58.162 | AstraZeneca - AZD1222 | NaN | 1.0 |
| 2 | Faroe Islands | FRO | EURO | OWID | 2021-11-05 | 78347 | 40205.0 | 160.334 | 82.278 | 38142.0 | 78.056 | Moderna - mRNA-1273, Pfizer BioNTech - Comirnaty | NaN | 2.0 |
| 3 | Greenland | GRL | EURO | OWID | 2021-12-10 | 78120 | 40351.0 | 137.603 | 71.076 | 37769.0 | 66.528 | Moderna - mRNA-1273 | NaN | 1.0 |
| 4 | Gibraltar | GIB | EURO | OWID | 2021-12-11 | 103818 | 41075.0 | 308.148 | 121.917 | 39907.0 | 118.450 | Pfizer BioNTech - Comirnaty | NaN | 1.0 |
We've just collected a lot of data from the internet, but it's a bit messy. It's missing values. There's variables we don't need. It's hard to read and difficult to work with when we're coding. We've reached the second part of the data science process: Cleaning the data
Let's go ahead and drop the variables we don't need
#PYTHON CODE HERE DROP UNNECESSARY COLUMNS
vax_df.drop(["ISO3", "WHO_REGION", "DATA_SOURCE", "DATE_UPDATED"], axis=1, inplace=True)
cases_deaths_df.drop(["WHO Region",
"Deaths - newly reported in last 24 hours"], axis=1, inplace=True)
We've dropped what we don't need. Let's merge these tables together so all the data is in one place.
# JOIN TABLES BASED ON THE NAME OF THE COUNTRY
# cases_deaths_df['COUNTRY'] = cases_deaths_df.index
cases_deaths_df.rename(columns={"Name":"COUNTRY"}, inplace=True)
merged_df = pd.merge(vax_df, cases_deaths_df, on="COUNTRY")
merged_df.head()
| COUNTRY | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Falkland Islands (Malvinas) | 4407 | 2632.0 | 126.529 | 75.567 | 1775.0 | 50.962 | AstraZeneca - AZD1222 | NaN | 1.0 | 61 | 1751.36 | 1 | 28.71 | 0 | 0 | 0.00 | 0 | 0.00 |
| 1 | Saint Helena | 7892 | 4361.0 | 129.995 | 71.833 | 3531.0 | 58.162 | AstraZeneca - AZD1222 | NaN | 1.0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 |
| 2 | Faroe Islands | 78347 | 40205.0 | 160.334 | 82.278 | 38142.0 | 78.056 | Moderna - mRNA-1273, Pfizer BioNTech - Comirnaty | NaN | 2.0 | 987 | 2019.85 | 8 | 16.37 | 2 | 2 | 4.09 | 1 | 2.05 |
| 3 | Greenland | 78120 | 40351.0 | 137.603 | 71.076 | 37769.0 | 66.528 | Moderna - mRNA-1273 | NaN | 1.0 | 122 | 214.89 | 13 | 22.90 | 0 | 0 | 0.00 | 0 | 0.00 |
| 4 | Gibraltar | 103818 | 41075.0 | 308.148 | 121.917 | 39907.0 | 118.450 | Pfizer BioNTech - Comirnaty | NaN | 1.0 | 5000 | 14840.76 | 178 | 528.33 | 20 | 94 | 279.01 | 0 | 0.00 |
One of those columns vaccines_used is actually a common mistake. It's got several variables in one column. Let's spread that out and turn it into numbers we can use
merged_df.dropna(subset=['VACCINES_USED'], inplace=True)
merged_df["AstraZeneca"] = 0
merged_df["Pfizer"] = 0
merged_df["Moderna"] = 0
merged_df["Janssen"] = 0
merged_df["Sinopharm"] = 0
merged_df["Covishield"] = 0
for index, row in merged_df.iterrows():
if "AstraZeneca" in row["VACCINES_USED"]:
merged_df.at[index, "AstraZeneca"] = 1
else:
merged_df.at[index, "AstraZeneca"] = 0
if "Pfizer" in row["VACCINES_USED"]:
merged_df.at[index, "Pfizer"] = 1
else:
merged_df.at[index, "Pfizer"] = 0
if "Moderna" in row["VACCINES_USED"]:
merged_df.at[index, "Moderna"] = 1
else:
merged_df.at[index, "Moderna"] = 0
if "Janssen" in row["VACCINES_USED"]:
merged_df.at[index, "Janssen"] = 1
else:
merged_df.at[index, "Janssen"] = 0
if "Covishield" in row["VACCINES_USED"]:
merged_df.at[index, "Covishield"] = 1
else:
merged_df.at[index, "Covishield"] = 0
if "Beijing" in row["VACCINES_USED"]:
merged_df.at[index, "Sinopharm"] = 1
else:
merged_df.at[index, "Sinopharm"] = 0
merged_df.drop(["VACCINES_USED"], axis=1, inplace=True)
merged_df.head()
| COUNTRY | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | AstraZeneca | Pfizer | Moderna | Janssen | Sinopharm | Covishield | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Falkland Islands (Malvinas) | 4407 | 2632.0 | 126.529 | 75.567 | 1775.0 | 50.962 | NaN | 1.0 | 61 | 1751.36 | 1 | 28.71 | 0 | 0 | 0.00 | 0 | 0.00 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | Saint Helena | 7892 | 4361.0 | 129.995 | 71.833 | 3531.0 | 58.162 | NaN | 1.0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | Faroe Islands | 78347 | 40205.0 | 160.334 | 82.278 | 38142.0 | 78.056 | NaN | 2.0 | 987 | 2019.85 | 8 | 16.37 | 2 | 2 | 4.09 | 1 | 2.05 | 0 | 1 | 1 | 0 | 0 | 0 |
| 3 | Greenland | 78120 | 40351.0 | 137.603 | 71.076 | 37769.0 | 66.528 | NaN | 1.0 | 122 | 214.89 | 13 | 22.90 | 0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | Gibraltar | 103818 | 41075.0 | 308.148 | 121.917 | 39907.0 | 118.450 | NaN | 1.0 | 5000 | 14840.76 | 178 | 528.33 | 20 | 94 | 279.01 | 0 | 0.00 | 0 | 1 | 0 | 0 | 0 | 0 |
Much nicer and now we can use the type of vaccine used to help build our model.
Next we have to deal with missing data. This a problem all data scientists deal with. We can take a simple route and simply remove the rows which have missing data, or we can go further into imputation. Imputation is the practice of replacing the missing data with made-up data based on the rest of table. The different techniques can get a little complicated, so we won't go over all of them. First let's get an idea of how much data is missing.
# Calculate how many rows have missing data. It should be a very small amount!
for column in merged_df:
print(merged_df[column].isna().sum(), end=" ")
0 0 3 0 3 3 3 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
There are 222 rows in our dataset. Above you can see how many values are missing in each column. For example, in the first column theres 0 values missing, in the third column there's 3 values missing. In one of the columns "FIRST_VACCINE_DATE" there's 16 values missing. All in all, not that much is missing.
Now, a lot of these nations that are missing data on when the first vaccine was administered are very small often isolated countries that aren't that useful to our dataset. Let's go ahead and drop the nations this applies to.
merged_df.dropna(subset=['FIRST_VACCINE_DATE'], inplace=True)
merged_df.head()
| COUNTRY | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | AstraZeneca | Pfizer | Moderna | Janssen | Sinopharm | Covishield | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | Philippines | 90259621 | 56110301.0 | 82.368 | 51.204 | 37353438.0 | 34.087 | 2021-03-01 | 10.0 | 1612541 | 1471.55 | 50352 | 45.95 | 6879 | 28141 | 25.68 | 823 | 0.75 | 1 | 1 | 1 | 1 | 1 | 0 |
| 10 | Nigeria | 12214538 | 8181113.0 | 5.925 | 3.969 | 4031004.0 | 1.955 | 2021-03-05 | 1.0 | 175264 | 85.02 | 3536 | 1.72 | 505 | 2163 | 1.05 | 29 | 0.01 | 0 | 0 | 0 | 0 | 0 | 1 |
| 11 | Oman | 5271912 | 2962504.0 | 103.237 | 58.013 | 2309408.0 | 45.224 | 2020-12-29 | 5.0 | 297431 | 5824.41 | 2414 | 47.27 | 309 | 3877 | 75.92 | 89 | 1.74 | 1 | 1 | 0 | 0 | 0 | 1 |
| 12 | Curaçao | 199029 | 103299.0 | 121.290 | 62.951 | 95730.0 | 58.339 | 2021-02-24 | 3.0 | 13692 | 8344.05 | 392 | 238.89 | 23 | 127 | 77.40 | 1 | 0.61 | 1 | 1 | 1 | 0 | 0 | 0 |
| 13 | Niue | 2352 | 1202.0 | 145.365 | 74.289 | 1150.0 | 71.075 | 2021-06-08 | 1.0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0 | 1 | 0 | 0 | 0 | 0 |
for column in merged_df:
print(merged_df[column].isna().sum(), end=" ")
0 0 2 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
We still have some missing data. Looks like the missing data is in the columns "PERSONS_VACCINATED_1PLUS_DOSE", "PERSONS_VACCINATED_1PLUS_DOSE_PER100", "PERSONS_FULLY_VACCINATED", and "PERSONS_FULLY_VACCINATED_PER100". We'd like to fill those in. Let's go ahead and do a Hot Deck Imputation. Hot Deck imputation is the process of imputing values from observations that are most similar to the observations missing data. Basically, for every row with missing data, we're gonna find the observation most similar and copy over the values.
How do we measure similarity? We'll it appears we're trying to impute data related to vaccination rates. So to evaluate similarity, we're going to compare values for total vacciantions per 100,000 people.
def similar_vax_rate(index):
most_similar_row = None
min_difference = sys.float_info.max
for i, row in merged_df.iterrows():
if index != i and not row.isnull().values.any():
new_difference = abs(merged_df.loc[i, "TOTAL_VACCINATIONS_PER100"] - merged_df.loc[index, "TOTAL_VACCINATIONS_PER100"])
if new_difference < min_difference:
min_difference = new_difference
most_similar_row = row
return most_similar_row
for index, row in merged_df.iterrows():
if row.isnull().values.any():
imputation_row = similar_vax_rate(index)
merged_df.at[index, "PERSONS_VACCINATED_1PLUS_DOSE"] = imputation_row["PERSONS_VACCINATED_1PLUS_DOSE"]
merged_df.at[index, "PERSONS_VACCINATED_1PLUS_DOSE_PER100"] = imputation_row["PERSONS_VACCINATED_1PLUS_DOSE_PER100"]
merged_df.at[index, "PERSONS_FULLY_VACCINATED"] = imputation_row["PERSONS_FULLY_VACCINATED"]
merged_df.at[index, "PERSONS_FULLY_VACCINATED_PER100"] = imputation_row["PERSONS_FULLY_VACCINATED_PER100"]
Data has been imputed!
for column in merged_df:
print(merged_df[column].isna().sum(), end=" ")
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Look! no more missing values. Let's start doing some real analysis...
It's time for some "exploratory analysis" and visualizations. We're going to look for patterns in the data, view relationships between variables, analyze the skew, and maybe transform the dataset.
Expect to see a lot of graphs!
We're going to be using Seaborn and matplotlib for much of the graphing.
First, let's start with a simple bar graph. We'll try and get a sense of how the pandemic has affected the world. Let's make a graph documenting the 30 countries with the highest cumulative deaths.
bar_df = merged_df.sort_values(["Deaths - cumulative total"], ascending=False)
bar_df = bar_df.iloc[:30,:]
plt.figure(figsize = (300, 120))
plt.title('Total Deaths per Nation')
sns.set(font_scale=10)
deaths_hist = sns.barplot(x = 'COUNTRY', y = "Deaths - cumulative total", data = bar_df)
deaths_hist.set_xticklabels(deaths_hist.get_xticklabels(), rotation=40, ha="right");
deaths_hist.set(xlabel='Countries', ylabel='Total cumulative deaths');
These countries you see above lost the most. Of course this will mostly include the nations with larger populations, but COVID-19 is a global problem. We believe the scale of the tragedy should be considered globally. You can see above that millions and millions of people have died because of the pandemic.
Let's see how the world has fought this tragedy. Let's look at the countries which have vaccinated the most.
bar_df = merged_df.sort_values(["TOTAL_VACCINATIONS"], ascending=False)
bar_df = bar_df.iloc[:30,:]
plt.figure(figsize = (300, 100))
plt.title('Total Vaccinations')
sns.set(font_scale=10)
deaths_hist = sns.barplot(x = 'COUNTRY', y = "TOTAL_VACCINATIONS", data = bar_df)
deaths_hist.set_xticklabels(deaths_hist.get_xticklabels(), rotation=40, ha="right");
deaths_hist.set(xlabel='Countries', ylabel='Total Vaccinations (Billions)');
Above we can see that billions of vaccinations have been distributed around the globe. Obviously, the world has made a great effort. One interesting thing to make note of is how the countries with the highest cumulative deaths also make the list of countries with highest total vaccinations.
This data is interesting to observe, but it's time for real analysis. We're going to have the explore relationships between variables in order to create a model.
Let's look at how vaccinations rates correlate with recent cases.
# Scatterplot Graph fully vaccianted per 100 people vs cases newly reported last 7 days per 100,000 population
X = merged_df["PERSONS_FULLY_VACCINATED_PER100"] * 1000
Y = merged_df["Cases - newly reported in last 7 days per 100000 population"]
plt.title('Vaccinations Versus Recent Cases')
sns.set(font_scale=1)
res = stats.linregress(X, Y)
graph = sns.scatterplot(x = "PERSONS_FULLY_VACCINATED_PER100",y = "Cases - newly reported in last 7 days per 100000 population", data=merged_df, s=10)
graph.set(xlabel='Fully Vaccinated per 100', ylabel='Cases in last 7-days per 100,000');
plt.plot(X, res.intercept + res.slope*X)
[<matplotlib.lines.Line2D at 0x7fdff3476c10>]
Hmm, it looks like the correlation between cases and fully vaccinated individuals appears to be positive. This means that, the more fully vaccinated individuals there are, the more cases appear.
This could have to do with population sizes.
Another thing we are aware of is that the COVID-19 vaccine is not 100% effective at preventing infection. It is still possible to get a breakthrough case.
We do know however, that the vaccine prevents hospitalizaton and death
Let's take a look at how the vaccine rate and death rates correlate.
# Scatterplot Graph fully vaccinated per 100,000 people vs deaths newly reported last 7 days
plt.title('Vaccinations Versus Recent Deaths')
sns.set(font_scale=1)
res = stats.linregress(merged_df["PERSONS_FULLY_VACCINATED_PER100"], merged_df["Deaths - newly reported in last 7 days per 100000 population"])
graph = sns.scatterplot(x = "PERSONS_FULLY_VACCINATED_PER100",y = "Deaths - newly reported in last 7 days per 100000 population", data=merged_df, s=10)
graph.set(xlabel='Fully Vaccinated per 100,000', ylabel='Deaths in last 7-days per 100,000');
plt.plot(merged_df['PERSONS_FULLY_VACCINATED_PER100'], res.intercept + res.slope*merged_df['PERSONS_FULLY_VACCINATED_PER100'])
[<matplotlib.lines.Line2D at 0x7fdddb968ad0>]
It still is a rather positive correlation, although the slope definitley is less steep.
This could be happening for a variety of reasons, such as population density.
At this point we want to use some Linear Regression in order to obtain a predictive model of our data. Once our model is made we can predict values for our data that don't exist. For instance, we can see how an even HIGHER vaccine rate would look for a country. OR on the other hand, we could see what it would look like if a country was not vaccinated at all.
# Linear Regression
X = pd.DataFrame(merged_df['PERSONS_FULLY_VACCINATED_PER100'])
Y = pd.DataFrame(merged_df['Deaths - newly reported in last 7 days per 100000 population'])
model = LinearRegression()
scores = []
kfold = KFold(n_splits=3, shuffle=True, random_state=42)
for i, (train, test) in enumerate(kfold.split(X, Y)):
model.fit(X.iloc[train,:], Y.iloc[train,:])
score = model.score(X.iloc[test,:], Y.iloc[test,:])
scores.append(score)
print(scores)
[-0.0335432730139158, 0.004820643966362503, -0.33707316732710657]
X = merged_df['TOTAL_VACCINATIONS_PER100'].values.reshape(-1, 1)
Y = merged_df['Cases - newly reported in last 7 days per 100000 population'].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X,Y)
T = reg.coef_
def grad_descent(X, Y, T, alpha):
m, n = X.shape
theta = np.zeros(n)
f = np.zeros(T)
for i in range(T):
f[i] = 0.5 * np.linalg.norm(X.dot(theta) - y)**2
g = np.transpose(X).dot(X.dot(theta) - y)
theta = theta - alpha*g
return theta, f
def runML(X, Y, xlabel, ylabel, title):
# split the data into training/testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
# create linear regression object
model = linear_model.LinearRegression()
# train the model using the training sets
model.fit(X_train, Y_train)
# make predictions using the testing set
Y_pred = model.predict(X_test)
print('Ordinary Least Squares (OLS)')
print('Coefficients: ', model.coef_[0])
print('Intercept: ', model.intercept_)
print('Mean squared error: %.2f'
% mean_squared_error(Y_test, Y_pred))
print('Coefficient of determination: %.2f'
% r2_score(Y_test, Y_pred))
plt.scatter(X, Y, color='black')
plt.plot(X_test, Y_pred, color='blue', linewidth=3)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
def runMLRANSAC(X, Y, xlabel, ylabel, title, num):
# Robustly fit linear model with RANSAC algorithm
ransac = linear_model.RANSACRegressor()
ransac.fit(X, Y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
Xin = X[inlier_mask]
Yin = Y[inlier_mask]
# split the data into training/testing sets
X_train, X_test, Y_train, Y_test = train_test_split(Xin, Yin, test_size=0.2)
# create linear regression object
model = linear_model.LinearRegression()
# train the model using the training sets
model.fit(X_train, Y_train)
# make predictions using the testing set
Y_pred = model.predict(X_test)
score = r2_score(Y_test, Y_pred)
if ((score < 0.95 or score > 1) and num < 250):
runMLRANSAC(X, Y, xlabel, ylabel, title, num+1)
else:
print('RANSAC')
print('Coefficients: ', model.coef_[0])
print('Intercept: ', model.intercept_)
print('Mean squared error: %.2f'
% mean_squared_error(Y_test, Y_pred))
print('Coefficient of determination: %.2f'
% r2_score(Y_test, Y_pred))
plt.scatter(X[inlier_mask], Y[inlier_mask], color='green', marker='.', s=200,
label='Inliers')
plt.scatter(X[outlier_mask], Y[outlier_mask], color='red', marker='.', s=200,
label='Outliers')
plt.plot(X_test, Y_pred, color='blue', linewidth=3, label='Line of Best Fit')
plt.legend(loc='lower right')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
X = merged_df['PERSONS_FULLY_VACCINATED_PER100'].values.reshape(-1, 1)
Y = merged_df['Cases - newly reported in last 7 days per 100000 population'].values.reshape(-1, 1)
runML(X, Y, 'Fully Vaccinated per 100', 'Cases newly reported in last 7 days per 100', 'Vaccine Rate vs Cases')
runMLRANSAC(X, Y, 'Fully Vaccinated per 100', 'Cases newly reported in last 7 days per 100', 'Vaccine Rate vs Cases', 0)
Ordinary Least Squares (OLS) Coefficients: [1.34963927] Intercept: [25.50009582] Mean squared error: 8071.07 Coefficient of determination: 0.04
RANSAC Coefficients: [0.51189316] Intercept: [2.00131908] Mean squared error: 200.22 Coefficient of determination: 0.27
# Gradient descent maybe?
# SVM
# include graphs
The data and analysis points towards a few conclusions
1.) Stuff
2.) Maybe more stuff
3.) Maybe even more stuff